I have a toxic relationship with R Studio
Hi, I am a student (not anymore very soon) who has been struggling with R Studio. Loving it and hating it (sometimes).
I use statistical analysis like t-tests, ANOVAs, and linear regressions often. I love to make graphs inside R Studio. Therefore I think it would be a great idea to post my journey here, mostly so that I won’t have to go to my past files and copy my codes whenever I want to do a new analysis.
Sometimes I use R Studio to do funny stuff, like drawing national flags, just for fun. So you will probably see those here.
Feel free to leave comments or suggestions, I am still an infant in the fascinating world of R, and am open to learn new things!
Best, cocoyamo
Reykjavik, 2024
first thing first: install and
library
I wish someone use this analogy when I was learning R Studio for the very first time.
People are constantly saying “oh you need to install this package”, and therefore I install every packages every time I am using R Studio.
Then people will ask you to “call the package out” using a function
called library.
These two things inside R Studio confused me a lot in the beginning, (and I was mocked by a TA who was very inconsiderate of people who have never learned coding before), hopefully here you can find some answer. And I will not mock you. I swear. I know the struggles.
I now think of this as making a sandwich. Pick whatever sandwich you like. I will do a peanut butter and chocolate sandwich here.
So, imagine you are going to make peanut butter and chocolate sandwich. You need ingredients. So you went to a nearby grocery store to purchase chocolate spread, peanut butter, and toast. (Feel free to switch to any other food you like, also please do not follow my recipe if you are allergic to peanut or chocolate!)
This step (getting ingredients from a store) is called
install.
Then, you go back to your kitchen, pull these ingredients out from
your reusable bag, this action is similar to the library
function.
Let’s assume these ingredients will never run out. (What a heaven!) Then, the next time you want the peanut butter and chocolate sandwich, all you need to do is to pull out these ingredients again. There is no need to go back to the store every time you want it, you already have it in your kitchen!
Therefore, whenever you need to use some package you have not used
before, you need to install them first, and then
afterwards, the only thing you need to do is to use the
library function to pull these magical stuff out of your
bag.
Okay, I hope you are still with me, and not already in the kitchen looking for your chocolate spread.
Let’s import our data!
note: If this is your first time import data, and if your
data is from an Excel file, then remember to type
install.packages("readxl") on your coding space
first.
install.packages("readxl")
library(readxl)
There are of course tons of ways to import your data, but the following is what I use most often.
I will click Import Dataset on the top right corner in
the Environment box. Then choose import, chose
From Excel (or From Text if I have a .csv
file), then use Browse on the top right corner to pick up
the file I want to import.
You can either double-click or click it once and then click the
open at the bottom right.
The middle part is the Data Preview, you can see your
data here, don’t import the wrong file!(I’ve done that multiple times,
especially when I am lazy at giving files proper names :P)
Then, you look at the bottom right corner, there is a section called
Code Preview.
The code can look something like this. (But not identical! Because (hopefully) we are using different file names.)
library(readxl)
wether_math <- read_excel("~/Documents/MMR/example files/wether_math.xlsx")
View(wether_math)
Now you have seen the Code Preview section, at the end
of it there is a tiny icon looking like a clipboard, you do what? CLICK
IT!
Great Job!!
Now you have copied the code necessary to import your file, press
Import at the bottom right.
Go back to the coding section. Paste it
(Ctrl/Command + v) on the coding section (the top left
section of the interface).
(Let’s hope R Studio never change their interface or else I will have to rewrite this part.)
Also, sometimes you can use head() to see the first few
rows of your data, just to make sure everything went well.
head(wether_math)
## # A tibble: 6 × 3
## id day score
## <dbl> <chr> <dbl>
## 1 1 sunny 9
## 2 2 sunny 8
## 3 3 sunny 7
## 4 4 sunny 8
## 5 5 sunny 8
## 6 6 sunny 8
Or you could just type in the data names.
wether_math
## # A tibble: 20 × 3
## id day score
## <dbl> <chr> <dbl>
## 1 1 sunny 9
## 2 2 sunny 8
## 3 3 sunny 7
## 4 4 sunny 8
## 5 5 sunny 8
## 6 6 sunny 8
## 7 7 sunny 6
## 8 8 sunny 5
## 9 9 sunny 3
## 10 10 sunny 4
## 11 1 rainy 1
## 12 2 rainy 3
## 13 3 rainy 5
## 14 4 rainy 7
## 15 5 rainy 6
## 16 6 rainy 5
## 17 7 rainy 6
## 18 8 rainy 7
## 19 9 rainy 5
## 20 10 rainy 3
Let us take a peek at this data, so you will have a better grasp in our future examples.
wether_math
## # A tibble: 20 × 3
## id day score
## <dbl> <chr> <dbl>
## 1 1 sunny 9
## 2 2 sunny 8
## 3 3 sunny 7
## 4 4 sunny 8
## 5 5 sunny 8
## 6 6 sunny 8
## 7 7 sunny 6
## 8 8 sunny 5
## 9 9 sunny 3
## 10 10 sunny 4
## 11 1 rainy 1
## 12 2 rainy 3
## 13 3 rainy 5
## 14 4 rainy 7
## 15 5 rainy 6
## 16 6 rainy 5
## 17 7 rainy 6
## 18 8 rainy 7
## 19 9 rainy 5
## 20 10 rainy 3
This is a made-up data. There are 10 imaginary people participate in this imaginary research. They do math test on both a rainy day and a sunny day. Their scores were recorded. Therefore, the first column is the participant number, the second column is the weather (either sunny or rainy), and the third column is their scores.
There are different data packages inside R. We can access those data
by simply install them. For example, we can install a package called
babynames .
This package contains baby names in the US from year 1880 to year 2017.
Let us take a look.
install.packages("babynames")
library(babynames)
babynames
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # ℹ 1,924,655 more rows
RStudio creates a separate page for the data when we use
View() . We can see that there are 5 columns:
year, sex, name, n,
and prop.
Wait, what is the prop?
When we encounter questions like this, we can type
?+function/dataset on the Console area. In this case, we
typed ?babynames.
Once we hit enter, we can see the Help screen on the right side
popped up some information. After reading the explanation of this
dataset, we learned that prop means n divided
by all babies of that sex with that name born in that year.
There are also built-in data inside dplyr packages. For
example, there is a dataset called starwars.
We can import them with a few lines of codes.
install.packages("dplyr")
library(dplyr)
We just called out the dplyr package, where
starwars dataset was stored in.
starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
These functions are commonly used for checking data.
View() will open a new page for reading datasets.
glimpse() head() tail()
dim() slice()
Also, you can check the qualities of datasets. For example,
ncol() & nrow() . Both of them can be
found at once when using dim().
I really love tidyverse. It is a game changer. Three packages I use most often in tidyverse are:
ggplot2dplyrtidyrBut first we need to know what it is.
Tidyverse contains several packages, making data
transformation and visualization (and my life) easier. Under the
umbrella of tidyverse, we will be able to wrangle with data
like a pro.
We will delve deeper into the land of ggplot2 in the
future chapter. Here, we focus on dplyr and
tidyr first.
If you have not use tidyverse before, you need
to download the package first.
install.packages("tidyverse")
install.packages("dplyr")
install.packages("tidyr")
Next, let us not forget to also call out our
packages, which we use library() function for the job.
library("tidyverse")
library("dplyr")
library("tidyr")
You are all set.
Before we jump into anything, let us meet a funny-looking friend. The pipe.
It looks like this:
%>%
In my humble opinion, this is one of the greatest analogy in this
century. Although the way it looks (%>%) does not strike
me as a real pipe at all.
Let us use an example to see how it works.
babynames package is a built-in R package. It contains
baby names in the US from year 1880 to year 2017.
We can take a look at the data.
babynames
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # ℹ 1,924,655 more rows
RStudio should have created a separate page for the data. We can see
that there are 5 columns: year, sex,
name, n, and prop.
Wait, what are prop?
When we encounter questions like this, we can type
?+function/dataset on the Console area. In this case, we
typed ?babynames.
Once we hit enter, we can see the Help screen on the right side
popped up some information. After reading the explanation of this
dataset, we learned that prop means n divided
by all babies of that sex with that name born in that year.
Now, let’s say we want to know names of babies born in year 2000.
babynames %>%
filter(year == 2000)
## # A tibble: 29,769 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2000 F Emily 25953 0.0130
## 2 2000 F Hannah 23080 0.0116
## 3 2000 F Madison 19967 0.0100
## 4 2000 F Ashley 17997 0.00902
## 5 2000 F Sarah 17697 0.00887
## 6 2000 F Alexis 17629 0.00884
## 7 2000 F Samantha 17266 0.00866
## 8 2000 F Jessica 15709 0.00787
## 9 2000 F Elizabeth 15094 0.00757
## 10 2000 F Taylor 15078 0.00756
## # ℹ 29,759 more rows
Here we see all the baby names in 2000!
In the code, I used %>% to direct my dataset
babynames into the next function filter().
Inside the filter() function, I told the program I want
babies who were born in year 2000. In other words, I filtered out the
babies which in the ‘year’ column, says ‘2000’.
A good thing about the %>% is that you can build one
after another. Just like you can direct the water from east to west,
then from west to north.
Let’s try it out.
babynames %>%
filter(year == 2000) %>%
filter(sex == "F")
## # A tibble: 17,653 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2000 F Emily 25953 0.0130
## 2 2000 F Hannah 23080 0.0116
## 3 2000 F Madison 19967 0.0100
## 4 2000 F Ashley 17997 0.00902
## 5 2000 F Sarah 17697 0.00887
## 6 2000 F Alexis 17629 0.00884
## 7 2000 F Samantha 17266 0.00866
## 8 2000 F Jessica 15709 0.00787
## 9 2000 F Elizabeth 15094 0.00757
## 10 2000 F Taylor 15078 0.00756
## # ℹ 17,643 more rows
Here we put F with a quotation mark because it is a character, instead of a number. We need to specify that so that RStudio can understands us.
Now the data is showing female babies who were both born in year 2000.
We can also acquire this output in another way.
babynames %>%
filter(year == 2000 & sex == "F")
## # A tibble: 17,653 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2000 F Emily 25953 0.0130
## 2 2000 F Hannah 23080 0.0116
## 3 2000 F Madison 19967 0.0100
## 4 2000 F Ashley 17997 0.00902
## 5 2000 F Sarah 17697 0.00887
## 6 2000 F Alexis 17629 0.00884
## 7 2000 F Samantha 17266 0.00866
## 8 2000 F Jessica 15709 0.00787
## 9 2000 F Elizabeth 15094 0.00757
## 10 2000 F Taylor 15078 0.00756
## # ℹ 17,643 more rows
Let us do more data transformation in the next chapter.
Here is a dataset for heros. I downloaded from https://github.com/dariomalchiodi/superhero-datascience/blob/master/content/data/heroes.csv?plain=1.
library(readr)
setwd ("~/Documents/MMR")
heroes <- read_delim("example files/heroes.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
View(heroes)
Note that the spread sheet is in .csv form. Therefore,
when we import data, we can choose From Text(readr) instead
of From Test(Excel).
However, we can see from the Preview to know that the
data is not ideal as they only have one rows, with every information
packed in each.
Here, we click on Delimiter and choose
Semicolon. Telling RStudio that in this dataset, we
separate different columns by ; instead of comma or
tab.
We can see from the data that it contains loads of heroes. If I want
a neater view of data I needed the most, I can use select()
to choose which columns I want to include for my further analysis.
heroes |>
select(c("Name", "Height", "Weight", "Gender", "Eye color", "Hair color", "Strength", "Intelligence"))
## # A tibble: 735 × 8
## Name Height Weight Gender `Eye color` `Hair color` Strength Intelligence
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 A-Bomb 203. 442. M Yellow No Hair 100 moderate
## 2 Abraxas NA NA M Blue Black 100 high
## 3 Abominat… 203. 442. M Green No Hair 80 good
## 4 Adam Mon… NA NA M Blue Blond 10 good
## 5 Agent 13 173. 61.0 F Blue Blond NA <NA>
## 6 Air-Walk… 189. 108. M Blue White 85 average
## 7 Agent Bob 178. 81.4 M Brown Brown 10 low
## 8 Abe Sapi… 191. 65.4 M Blue No Hair 30 high
## 9 Abin Sur 186. 90.9 M Blue No Hair 90 average
## 10 Angela NA NA F <NA> <NA> 100 high
## # ℹ 725 more rows
Now, if I want to choose heroes that have a high intelligence level,
we can use filter().
This function filters out the assigned rows.
heroes |>
filter(Intelligence == "high")
## # A tibble: 144 × 12
## Name Identity `Birth place` Publisher Height Weight Gender
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Abraxas Abraxas Within Eternity Marvel C… NA NA M
## 2 Abe Sapien Abraham Sapien <NA> Dark Hor… 191. 65.4 M
## 3 Angela <NA> <NA> Image Co… NA NA F
## 4 Yoda Yoda <NA> George L… 66.3 17.0 M
## 5 Zatanna Zatanna Zatara <NA> DC Comics 170. 57.8 F
## 6 Yellowjacket Hank Pym Elmsford, New York Marvel C… 183. 83.7 M
## 7 X-Man Nate Grey American Northeas… Marvel C… 176. 61.8 M
## 8 Wonder Woman Diana Prince Themyscira DC Comics 183. 74.7 F
## 9 Watcher Uatu <NA> Marvel C… NA NA M
## 10 Warlock Adam Warlock The Beehive, Shar… Marvel C… 188. 108. M
## # ℹ 134 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## # `Hair color` <chr>, Strength <dbl>, Intelligence <chr>
The select() and filter() can be stacked
together.
Let us look for heros who are female, has strength over 50, and have high intelligence level.
heroes |>
select(c("Name", "Gender", "Strength", "Intelligence")) |>
filter(Gender == "F" & Strength >= 50 & Intelligence == "high")
## # A tibble: 18 × 4
## Name Gender Strength Intelligence
## <chr> <chr> <dbl> <chr>
## 1 Angela F 100 high
## 2 Wonder Woman F 100 high
## 3 Valkyrie F 95 high
## 4 Supergirl F 100 high
## 5 Silk Spectre II F 55 high
## 6 She-Hulk F 100 high
## 7 Power Girl F 100 high
## 8 Phoenix F 100 high
## 9 Lady Bullseye F 75 high
## 10 Jean Grey F 80 high
## 11 Granny Goodness F 100 high
## 12 Giganta F 90 high
## 13 Faora F 95 high
## 14 Donna Troy F 95 high
## 15 Cheetah III F 100 high
## 16 Cat F 90 high
## 17 Captain Marvel F 90 high
## 18 Big Barda F 100 high
Next, if we want the same criteria except that we want both the
intelligence good and high.
There are two ways of doing it. The first method is to use
| as or. The second is method is to use
%in%, meaning that if the data fits either of the option,
then choose them.
# method 1
heroes |>
select(c("Name", "Gender", "Strength", "Intelligence")) |>
filter(Gender == "F" & Strength >= 50) |>
filter(Intelligence == "high" | Intelligence == "good")
## # A tibble: 47 × 4
## Name Gender Strength Intelligence
## <chr> <chr> <dbl> <chr>
## 1 Angela F 100 high
## 2 Wonder Girl F 90 good
## 3 Wonder Woman F 100 high
## 4 Valkyrie F 95 high
## 5 Vindicator F 65 good
## 6 Thor Girl F 85 good
## 7 T-X F 65 good
## 8 Supergirl F 100 high
## 9 Stargirl F 80 good
## 10 Spider-Gwen F 55 good
## # ℹ 37 more rows
# method 2
heroes |>
select(c("Name", "Gender", "Strength", "Intelligence")) |>
filter(Gender == "F" & Strength >= 50) |>
filter(Intelligence %in% c("high", "good"))
## # A tibble: 47 × 4
## Name Gender Strength Intelligence
## <chr> <chr> <dbl> <chr>
## 1 Angela F 100 high
## 2 Wonder Girl F 90 good
## 3 Wonder Woman F 100 high
## 4 Valkyrie F 95 high
## 5 Vindicator F 65 good
## 6 Thor Girl F 85 good
## 7 T-X F 65 good
## 8 Supergirl F 100 high
## 9 Stargirl F 80 good
## 10 Spider-Gwen F 55 good
## # ℹ 37 more rows
As we can see, both methods led to the same results. I would say the first one is more straight forward, however, I feel cooler when using the second method. :D
If we want to know the heroes’ BMI (body mass index), we can count the index by dividing their weights (in kilograms) by the square of their heights (in meters).
\[ weight(kg)/(height*height(m)) \]
According to this formula, we need to do the following things:
add a new column that converts height from centimeters to meters
calculate the square of the height in meters
divide weight by the square of the heights
heroes |>
mutate(Height_m = Height / 100) |> # convert heights from cm to m
mutate(Height_m2 = Height_m*Height_m) |> # square of the heights
mutate(BMI = Weight / Height_m2) # divide weights by the squrare of the heights
## # A tibble: 735 × 15
## Name Identity `Birth place` Publisher Height Weight Gender
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 A-Bomb Richard Milhouse Jo… Scarsdale, A… Marvel C… 203. 442. M
## 2 Abraxas Abraxas Within Etern… Marvel C… NA NA M
## 3 Abomination Emil Blonsky Zagreb, Yugo… Marvel C… 203. 442. M
## 4 Adam Monroe <NA> <NA> NBC - He… NA NA M
## 5 Agent 13 Sharon Carter <NA> Marvel C… 173. 61.0 F
## 6 Air-Walker Gabriel Lan Xandar, a pl… Marvel C… 189. 108. M
## 7 Agent Bob Bob <NA> Marvel C… 178. 81.4 M
## 8 Abe Sapien Abraham Sapien <NA> Dark Hor… 191. 65.4 M
## 9 Abin Sur <NA> Ungara DC Comics 186. 90.9 M
## 10 Angela <NA> <NA> Image Co… NA NA F
## # ℹ 725 more rows
## # ℹ 8 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## # `Hair color` <chr>, Strength <dbl>, Intelligence <chr>, Height_m <dbl>,
## # Height_m2 <dbl>, BMI <dbl>
We can always simplified (or complicated, depending on your perspective) by mixing them into one line.
heroes |>
mutate(BMI = Weight / (Height/100)^2)
## # A tibble: 735 × 13
## Name Identity `Birth place` Publisher Height Weight Gender
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 A-Bomb Richard Milhouse Jo… Scarsdale, A… Marvel C… 203. 442. M
## 2 Abraxas Abraxas Within Etern… Marvel C… NA NA M
## 3 Abomination Emil Blonsky Zagreb, Yugo… Marvel C… 203. 442. M
## 4 Adam Monroe <NA> <NA> NBC - He… NA NA M
## 5 Agent 13 Sharon Carter <NA> Marvel C… 173. 61.0 F
## 6 Air-Walker Gabriel Lan Xandar, a pl… Marvel C… 189. 108. M
## 7 Agent Bob Bob <NA> Marvel C… 178. 81.4 M
## 8 Abe Sapien Abraham Sapien <NA> Dark Hor… 191. 65.4 M
## 9 Abin Sur <NA> Ungara DC Comics 186. 90.9 M
## 10 Angela <NA> <NA> Image Co… NA NA F
## # ℹ 725 more rows
## # ℹ 6 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## # `Hair color` <chr>, Strength <dbl>, Intelligence <chr>, BMI <dbl>
The results will be the same with either one of the coding methods above. Choose what suits you the best.
There is a fun tip to put the new column on the left, instead of the right as default. The reason I like this method is that it helps me check the added column more easily, especially when there are lots of column. By doing so, I don’t have to go to the very end of the data to see my new added column.
heroes |>
mutate(BMI = Weight / (Height/100)^2, .before = 1)
## # A tibble: 735 × 13
## BMI Name Identity `Birth place` Publisher Height Weight Gender
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 107. A-Bomb Richard Milho… Scarsdale, A… Marvel C… 203. 442. M
## 2 NA Abraxas Abraxas Within Etern… Marvel C… NA NA M
## 3 107. Abomination Emil Blonsky Zagreb, Yugo… Marvel C… 203. 442. M
## 4 NA Adam Monroe <NA> <NA> NBC - He… NA NA M
## 5 20.3 Agent 13 Sharon Carter <NA> Marvel C… 173. 61.0 F
## 6 30.4 Air-Walker Gabriel Lan Xandar, a pl… Marvel C… 189. 108. M
## 7 25.6 Agent Bob Bob <NA> Marvel C… 178. 81.4 M
## 8 17.9 Abe Sapien Abraham Sapien <NA> Dark Hor… 191. 65.4 M
## 9 26.4 Abin Sur <NA> Ungara DC Comics 186. 90.9 M
## 10 NA Angela <NA> <NA> Image Co… NA NA F
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## # `Hair color` <chr>, Strength <dbl>, Intelligence <chr>
If I prefer the Name before BMI, I can use
.after = "Name" (column name) to let RStudio know I want my
new column be put on the right of the column I designated.
heroes |>
mutate(BMI = Weight / (Height/100)^2, .after = "Name")
## # A tibble: 735 × 13
## Name BMI Identity `Birth place` Publisher Height Weight Gender
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 A-Bomb 107. Richard Milho… Scarsdale, A… Marvel C… 203. 442. M
## 2 Abraxas NA Abraxas Within Etern… Marvel C… NA NA M
## 3 Abomination 107. Emil Blonsky Zagreb, Yugo… Marvel C… 203. 442. M
## 4 Adam Monroe NA <NA> <NA> NBC - He… NA NA M
## 5 Agent 13 20.3 Sharon Carter <NA> Marvel C… 173. 61.0 F
## 6 Air-Walker 30.4 Gabriel Lan Xandar, a pl… Marvel C… 189. 108. M
## 7 Agent Bob 25.6 Bob <NA> Marvel C… 178. 81.4 M
## 8 Abe Sapien 17.9 Abraham Sapien <NA> Dark Hor… 191. 65.4 M
## 9 Abin Sur 26.4 <NA> Ungara DC Comics 186. 90.9 M
## 10 Angela NA <NA> <NA> Image Co… NA NA F
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## # `Hair color` <chr>, Strength <dbl>, Intelligence <chr>
One last tip before we move on, if I want to only keep the columns we
used during the mutate() process, instead of using
select() and specify BMI, Height,
and Weight columns, we can also use
.keep = "used" to tell RStudio that we only want to keep
those columns.
heroes |>
mutate(BMI = Weight / (Height/100)^2, .keep = "used")
## # A tibble: 735 × 3
## Height Weight BMI
## <dbl> <dbl> <dbl>
## 1 203. 442. 107.
## 2 NA NA NA
## 3 203. 442. 107.
## 4 NA NA NA
## 5 173. 61.0 20.3
## 6 189. 108. 30.4
## 7 178. 81.4 25.6
## 8 191. 65.4 17.9
## 9 186. 90.9 26.4
## 10 NA NA NA
## # ℹ 725 more rows
Now, what I experienced in elementary school, is that after the health examination each year, students would be categorized by their BMI. Then, homeroom teachers would tell those students who were categorized as “overweight”, “obese”, or “underweight” to be careful of their BMIs.
I am not going to discuss what kind of mental damage this caused me (as a child who often got the “overweight” label), but to use this as an example of how to categorize different numerical data into groups (categories) in RStudio.
According to Wikipedia, BMI can be simply categorized as following:
| BMI | Category |
|---|---|
| < 18.5 | underweight (UW) |
| 18.5 - 24.9 (inclusive) | normal weight (NW) |
| 25.0 - 29.9 (inclusive) | overweight (OW) |
| >= 30 | obese (OB) |
What we are going to do now, is to also categorize those heroes according to their BMIs.
We are still using mutate() since we are adding new
columns containing their BMI status.
heroes |>
mutate(BMI = Weight / (Height/100)^2, .after = "Name") |>
mutate(BMI_status = case_when(BMI < 18.5 ~ "underweight",
BMI >= 18.5 & BMI <= 24.9 ~ "normal_weight",
BMI >= 25 & BMI <= 29.9 ~ "overweight",
BMI >= 30 ~ "obese"), .after = "BMI") -> heroes2
heroes2 |>
count(BMI_status)
## # A tibble: 5 × 2
## BMI_status n
## <chr> <int>
## 1 normal_weight 201
## 2 obese 127
## 3 overweight 114
## 4 underweight 41
## 5 <NA> 252
Here, we can see how many heroes are obese.
We can also see that there is a row <NA>, it
happens when the data is not complete. If we do not want those data with
incomplete BMI information in our further analysis, we can remove the
<NA> row by simply putting na.omit()
into our codes. It would be easier to do further analysis as well.
heroes2 |>
na.omit() |>
count(BMI_status)
## # A tibble: 4 × 2
## BMI_status n
## <chr> <int>
## 1 normal_weight 48
## 2 obese 40
## 3 overweight 40
## 4 underweight 13
We can also update our heroes2 data
with data with not omissions.
heroes2 |>
na.omit() -> heroes2
If we want further categorization, for instance, to know how many
female and male heroes are obese or underweight, we can achieve it by
adding another column into the count() function.
heroes2 |>
count(Gender, BMI_status)
## # A tibble: 8 × 3
## Gender BMI_status n
## <chr> <chr> <int>
## 1 F normal_weight 26
## 2 F obese 1
## 3 F overweight 1
## 4 F underweight 12
## 5 M normal_weight 22
## 6 M obese 39
## 7 M overweight 39
## 8 M underweight 1
We can also use group_by() function, which will shortly
be introduced in future chapters, to achieve the same result.
heroes2 |>
group_by(Gender) |>
count(BMI_status)
## # A tibble: 8 × 3
## # Groups: Gender [2]
## Gender BMI_status n
## <chr> <chr> <int>
## 1 F normal_weight 26
## 2 F obese 1
## 3 F overweight 1
## 4 F underweight 12
## 5 M normal_weight 22
## 6 M obese 39
## 7 M overweight 39
## 8 M underweight 1
One last thing before moving on to the next chapter, is that we can
actually change the column names in our data without going back to the
.csv or .xlsx files!
Let’s say we want to change the column Hair color into
Hair_color, and Eye color into
Eye_color.
colnames(heroes2)[colnames(heroes2) == "Hair color"] <- "Hair_color"
colnames(heroes2)[colnames(heroes2) == "Eye color"] <- "Eye_color"
This part is a bit abstract when I first learnt it. However, it was very useful when we have a wide data but need to do further analysis. RStudio prefers data being presented as a long format. For example:
| Name | Height (cm) | Age |
|---|---|---|
| Lily | 150 | 20 |
| Mary | 160 | 18 |
| John | 156 | 14 |
| Steve | 180 | 15 |
| Amy | 174 | 16 |
The above table can go on and on, by adding new names below. However, sometimes we will encounter data that looks like this:
| Name | Eng_C1 | Eng_C2 | Eng_C3 | Math_C1 | Math_C2 | Math_C3 |
|---|---|---|---|---|---|---|
| Kelly | 90 | 95 | 80 | 90 | 80 | 97 |
| Liam | 70 | 80 | 94 | 80 | 89 | 79 |
| Henry | 79 | 80 | 85 | 96 | 92 | 90 |
This is a made-up data displaying different students’ exam performance. They took exams on both English and Math, from Chapter 1 (C1) to Chapter 3 (C3).
In these cases, it is better to convert our wide data into long data. As we move along, we will start doing analysis (t-tests, ANOVAs…), when we are doing the analysis, it would be easier if we could assign rows as variables. For example, in the above table, I would want the subject (English or Math) to be one independent variable, and their scores as dependent variable.
Let us create the data first.
performance_original <- data.frame(
Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
Eng_C1 = c(90, 70, 79, 98, 81, 79),
Eng_C2 = c(95, 80, 80, 93, 93, 70),
Eng_C3 = c(80, 94, 85, 89, 73, 91),
Math_C1 = c(90, 80, 96, 83, 92, 68),
Math_C2 = c(80, 89, 92, 72, 84, 83),
Math_C3 = c(97, 79, 90, 74, 70, 92))
We want to make the data look somewhat like this:
| Name | Subject | Score |
|---|---|---|
| Kelly | Eng_C1 | 90 |
| Kelly | Eng_C2 | 95 |
| Kelly | Eng_C3 | 80 |
| Kelly | Math_C1 | 90 |
| Kelly | Math_C2 | 80 |
| Kelly | Math_C3 | 97 |
| Liam | Eng_C1 | 70 |
| Liam | Eng_C2 | 80 |
| Liam | Eng_C3 | 94 |
| Liam | Math_C1 | 80 |
| Liam | Math_C2 | 89 |
| Liam | Math_C3 | 79 |
| Henry | Eng_C1 | 79 |
| Henry | Eng_C2 | 80 |
| … and so on |
Here, we will use pivot_longer() to help us.
performance_original |>
pivot_longer(!Name, names_to = "Subject", values_to = "Score")
## # A tibble: 36 × 3
## Name Subject Score
## <chr> <chr> <dbl>
## 1 Kelly Eng_C1 90
## 2 Kelly Eng_C2 95
## 3 Kelly Eng_C3 80
## 4 Kelly Math_C1 90
## 5 Kelly Math_C2 80
## 6 Kelly Math_C3 97
## 7 Liam Eng_C1 70
## 8 Liam Eng_C2 80
## 9 Liam Eng_C3 94
## 10 Liam Math_C1 80
## # ℹ 26 more rows
We can see that the data format became what we wanted. Let us see
what are inside the code chunks. !Name means that I am
asking RStudio to not consider the column Name as we still
want the dataset contain students’ names.
Right after the !Name, we see the code
names_to =. When we are compressing multiple columns into
one (like we are doing right now), the name of the new column can be
specify by names_to =. On this basis, when we look at
values_to =, we can infer that it means putting the
numerical data into this new column.
However, there are other ways to accomplish the same result. It works when the original column numbers are not huge, or when you have too many columns to exclude than just names.
performance_original |>
pivot_longer(cols = c("Eng_C1", "Eng_C2", "Eng_C3", "Math_C1", "Math_C2", "Math_C3"),
names_to = "Subject", values_to = "Score") -> performance
Here, we use cols = to select the columns we want to
compress. However, in this example, it would be easier to use
!Names instead of using cols = and type out
all columns that needed compressing.
Now, bear with me when this gets even more abstract.
If we want to further separate our data by chapters.
For instance, we want to further separate Subject columns
to Subject_em AND Chapter. In this case, we
need to use separate() function.
performance |>
separate(
col = Subject,
into = c("Subject_em", "Chapter"),
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE) -> performance
I know it looks scary, but please hear me out. This would be one of your best friends if you try to understand it. :)
As we mentioned, we want to further separate our Subject
column into Subject_em and Chapter. Hence, we
use col = to specify which column is our target. Then, we
use into = to tell RStudio that we want to separate the
column above to columns with these names.
Then, we saw a bizarre-looking code
sep = "[^[:alnum:]]+". We need to separate this line to
understand it. In general, it means “one or more non-alphabet or
non-numeric characters”.
[:alnum:] means the alphabet A-Z
(a-z), and number 0-9.
[^] means not include.
+ means one or more.
When we combine the above segments together, means that we want
one or more things that are
not alphabet nor number. In this case, we just want to
fine the character _, and let RStudio knows that this
_ is the separation point.
Moving on, we see remove = TRUE and
convert = FALSE. Both of them are defaults in this
separate() function. We do not have to type it. However, I
want to put it here to remind myself.
remove = TRUE means to delete the original column
Subject in this case.
convert = FALSE means that we do not want RStudio to
change the types of columns for us. Sometimes after the separation,
there would be columns with numerical data (for instance, separate
Eng_1 into Eng and 1), if we
choose convert = TRUE, then the column would be
automatically changed into numeric. However, we can change the column
type later on if we want to. We don’t need RStudio to do that for us in
this step. Thus, we keep this argument FALSE.
Congratulations! Now we learned how to pivot our wide format data into long format data, and we learned how to separate columns. (After 5 years of learning and using RStuido, I am still stunned by its power. EVERY SINGLE TIME.)
If we have pivot_longer(), there must be a
pivot_wider(). As a matter of fact it did! However, like I
said, RStudio prefers data with a long format for further analysis and
graphing, therefore, I will only mention pivot_longer()
here, as in my opinion it is more useful for data analysis and data
visualization. (Or I will add it when I am more proficient with
pivot_wider() function haha).
This is another useful function when we have multiple datasets/files and we want to combine them into one dataset.
Here, we are going to continue using the performance
dataset. Let’s say that we also documented their Biology exams in
another file.
performance_biology <- data.frame(
Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
Bio_C1 = c(80, 80, 83, 93, 80, 88),
Bio_C2 = c(73, 76, 75, 77, 79, 83),
Bio_C3 = c(90, 93, 94, 90, 81, 82))
We pull out the original performance data, and want to add these biology columns on it.
performance_original |>
left_join(performance_biology, by = "Name")
## Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3 Bio_C1 Bio_C2 Bio_C3
## 1 Kelly 90 95 80 90 80 97 80 73 90
## 2 Liam 70 80 94 80 89 79 80 76 93
## 3 Henry 79 80 85 96 92 90 83 75 94
## 4 Alice 98 93 89 83 72 74 93 77 90
## 5 Jake 81 93 73 92 84 70 80 79 81
## 6 Dan 79 70 91 68 83 92 88 83 82
Here, we can see that we have successfully added the biology columns
on the original dataset. Of course, when we have multiple independent
variables (columns), we can use
by = c("column1", "column2", …) to specify the left join
basis.
Yes. You guess right. There is also a right_join() in
RStudio. Although it may not be as often used as the
left_join() (in my usual data analysis), I still sometimes
use it. So let us look at the differences between
left_join() and right_join().
left_join(): keeps data on the left, even if there is no
matching data on the right.
right_join(): keeps data on the right, even if there is
no matching data on the left.
If they encountered missing data, they will put an NA on
it. It is as simple as that but still extremely powerful (and
magical).
Except for left and right join, there is another way to combine different datasets into one.
dataset1 <- data.frame(
brand = c("1","2","3"),
price = c(200, 300, 250),
profit = c(100, 200, 200))
dataset2 <- data.frame(
brand = c("1","2","3"),
sales = c(20, 40, 60))
We have created two separate datassets. One contains a product’s brand, price, and profit. The other contains the sales number.
cbind(dataset1, dataset2) -> merged_data
merged_data
## brand price profit brand sales
## 1 1 200 100 1 20
## 2 2 300 200 2 40
## 3 3 250 200 3 60
Through cbind (column bind), we can merge two
datassets.
However, we can also see that it repeated the column
brand into the merged data. If this does not affect your
future analysis, then you could leave it. Or you could simply use
left_join() (as mentioned above) to do the binding without
creating the extra column.
dataset1 |>
left_join(dataset2, by = "brand")
## brand price profit sales
## 1 1 200 100 20
## 2 2 300 200 40
## 3 3 250 200 60
Let us look at the strength difference from different publishers.
heroes2 |>
group_by(Publisher) |>
summarize(mean_strength = mean(Strength, na.rm = TRUE))
## # A tibble: 4 × 2
## Publisher mean_strength
## <chr> <dbl>
## 1 DC Comics 64.6
## 2 George Lucas 33.3
## 3 Image Comics 60
## 4 Marvel Comics 43.1
We can see from the result that the mean strength of difference publishers are different. However, is the difference significant?
Let us look at DC Comics vs Marvel Comics.
heroes2 |>
filter(Publisher %in% c("Marvel Comics", "DC Comics")) |>
group_by(Publisher) |>
summarize(Strengths = list(Strength)) |>
with(t.test(Strengths[[1]], Strengths[[2]], paired = FALSE))
##
## Welch Two Sample t-test
##
## data: Strengths[[1]] and Strengths[[2]]
## t = 2.8807, df = 32.441, p-value = 0.00698
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.297946 36.652054
## sample estimates:
## mean of x mean of y
## 64.600 43.125
That looks complicated. I do not normally use that.
I first subset different groups of data.
subset(heroes2, Publisher == "Marvel Comics") -> marvel
subset(heroes2, Publisher == "DC Comics") -> DC
Then, I calculate the difference between these two groups of data
using t.test().
heroes2 |>
with(t.test(marvel$Strength, DC$Strength, paired = F))
##
## Welch Two Sample t-test
##
## data: marvel$Strength and DC$Strength
## t = -2.8807, df = 32.441, p-value = 0.00698
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -36.652054 -6.297946
## sample estimates:
## mean of x mean of y
## 43.125 64.600
We can see from the result that there is no significant difference on strength level between heroes from DC comics and from Marcel comics.
Here, because the heroes from DC universe are different from the
heroes from Marvel universe. Thus, we cannot use pairwise
t-test.
However, when we are calculate a same group of people doing different
tasks (or doing same task repeatedly), then we can use
pairwise t-test.
I hope you still remember the performance data. Let us say we want to look at if there is any difference between exam scores of Chapter 1 and Chapter 3.
# group them first
subset(performance, Subject_em == "Eng" & Chapter == "C1") -> eng_c1
subset(performance, Subject_em == "Eng" & Chapter == "C3") -> eng_c3
performance |>
with(t.test(eng_c1$Score, eng_c3$Score, paired = T))
##
## Paired t-test
##
## data: eng_c1$Score and eng_c3$Score
## t = -0.44114, df = 5, p-value = 0.6775
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -17.06789 12.06789
## sample estimates:
## mean difference
## -2.5
Here is a template for anova_test().
library(rstatix)
heroes2 |>
anova_test(
dv = # dependent variable,
wid = # subject name/id,
type = NULL, # default: type II ANOVA
between = # between-subject independent variable,
within = # within-subject independent variable,
effect.size = "ges",
error = NULL,
white.adjust = FALSE,
observed = NULL,
detailed = TRUE
)
detailed = F is the default. However, I would like
my results a bit more informative. Thus, I would normally change it to
detailed = T.
type = NULL is the default. It uses type 2 ANOVA.
When needed to change, just switch it from NULL to
1 or 3.
You can add covariate = below within =,
above type = , for ANCOVA.
effect.size = is the effect size to compute.
pes (partial eta squared)
ges (generalized eta squared)
Default
The Help section in RStudio says I could choose
both, but I haven’t figure out how to do so yet.
IV: intelligence
DV: strength
heroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = Intelligence,
detailed = T
)
## ANOVA Table (type II tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 Intelligence 8758.544 129262 4 136 2.304 0.062 0.063
IV: gender
DV: strength
heroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = Gender,
detailed = T
)
## ANOVA Table (type II tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 Gender 3772.182 134248.4 1 139 3.906 0.05 0.027
heroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = c(Gender, BMI_status),
detailed = TRUE
)
## ANOVA Table (type III tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 (Intercept) 68473.044 109803.3 1 133 82.938 1.11e-15 * 0.384000
## 2 Gender 51.278 109803.3 1 133 0.062 8.04e-01 0.000467
## 3 BMI_status 8413.707 109803.3 3 133 3.397 2.00e-02 * 0.071000
## 4 Gender:BMI_status 6885.839 109803.3 3 133 2.780 4.40e-02 * 0.059000
Careful of the ANOVA type. Even if I did not specify which types of
ANOVA I want to use, the system chose type III for me. Add
type = 2 if you would like to stick with
type II ANOVA.
heroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = c(Gender, BMI),
detailed = T
)
However, we would encounter an error here. Why is that?
Step 1
We examine if the two independent variables are in different types.
For example, categories and numerical
variables. This would happen if we put Gender (category)
and BMI (numbers) in the between =
section.
Solutions: 1. change the numerical columns into categorical column. For
example, categorize BMI into different levels like obese, overweight,
normal, and underweight. (We did this in the previous section, so I am
just going to put the end code here). 2. use ANCOVA (continuous *
category ~ continuous) instead of ANOVA. Simply put the numerical
variables in the covariate = section. I will not go further
on this topic here as we are in the ANOVA section, and as I
am not familiar with ANCOVA. Will come back when I use it more
often.
# solution 1
heroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = c(Gender, BMI_status),
detailed = T
)
## ANOVA Table (type III tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 (Intercept) 68473.044 109803.3 1 133 82.938 1.11e-15 * 0.384000
## 2 Gender 51.278 109803.3 1 133 0.062 8.04e-01 0.000467
## 3 BMI_status 8413.707 109803.3 3 133 3.397 2.00e-02 * 0.071000
## 4 Gender:BMI_status 6885.839 109803.3 3 133 2.780 4.40e-02 * 0.059000
# solution 2 (ANCOVA)
heroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = Gender,
covariate = BMI,
detailed = T
)
## ANOVA Table (type II tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 BMI 23336.275 110912.1 1 138 29.036 2.99e-07 * 1.74e-01
## 2 Gender 10.606 110912.1 1 138 0.013 9.09e-01 9.56e-05
Let us take a look at another example.
We know in the heroes2 data, the
Intelligence is a categorical variable.
Heroes’ intelligence levels is labelled as high,
good, average, moderate, and
low.
heroes2 |>
distinct(Intelligence)
## # A tibble: 5 × 1
## Intelligence
## <chr>
## 1 moderate
## 2 good
## 3 high
## 4 average
## 5 low
Therefore, it should not encounter the same problem as example 2.
heroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = c(Gender, Intelligence),
detailed = T)
However, we still see errors here. If we finished examining that the Step 1 is not the issue here. We can try to debug using the following ways.
Step 2
We examine both independent variables separately to make sure they are both working fine individually. Just delete one of the variables each time to make sure. (Usually this is not the problem, but just in case, I would still check it)
Genderheroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = Gender,
detailed = T)
## ANOVA Table (type II tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 Gender 3772.182 134248.4 1 139 3.906 0.05 0.027
Intelligenceheroes2 |>
anova_test(
dv = Strength,
wid = Name,
between = Intelligence,
detailed = T)
## ANOVA Table (type II tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 Intelligence 8758.544 129262 4 136 2.304 0.062 0.063
Both variables look fine and the code work smoothly individually.
Step 3
We examine if there are any missing values in any of the category. (I often forgot about this step, and got confused on how my ANOVA did not work for hours…)
table(heroes2$Gender, heroes2$Intelligence)
##
## average good high low moderate
## F 14 20 4 0 2
## M 24 41 27 2 7
Here, we can see that in female heroes, low intelligence level, there
is a zero. This would make the ANOVA unable to proceed. Now we found the
problem, we could simply remove the intelligence == "low"
from our analysis as it hinders it.
heroes2 |>
filter(Intelligence != "low") |>
anova_test(
dv = Strength,
wid = Name,
between = c(Gender, Intelligence),
type = 2,
detailed = T)
## ANOVA Table (type II tests)
##
## Effect SSn SSd DFn DFd F p p<.05 ges
## 1 Gender 1982.076 126823.2 1 131 2.047 0.155 0.015
## 2 Intelligence 3719.933 126823.2 3 131 1.281 0.284 0.028
## 3 Gender:Intelligence 406.757 126823.2 3 131 0.140 0.936 0.003
Maybe you would think: I’ll just add group_by() before
the anova_test(). However, while this would work, it would
not be two-way ANOVA as we wanted. Instead, it would
just separate the data into two groups before performing one-way
ANOVA on each. See the codes below:
heroes2 |>
filter(Intelligence != "low") |>
group_by(Gender) |>
anova_test(
dv = Strength,
wid = Name,
between = Intelligence,
detailed = T
)
## # A tibble: 2 × 10
## Gender Effect SSn SSd DFn DFd F p `p<.05` ges
## * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 F Intelligence 1564. 33535. 3 36 0.56 0.645 "" 0.045
## 2 M Intelligence 2563. 93288. 3 95 0.87 0.46 "" 0.027
We can see the result table is different from the two-way ANOVA table
(refer to the upper table), as well as the numbers are different. (See
DFn and DFd in both results.)
If we want to add more independent variables, we could just add more
column names in the between = section.
heroes2 |>
filter(Intelligence != "low") |>
anova_test(
dv = Strength,
wid = Name,
between = c(Gender, Intelligence, BMI_status),
type = 2,
detailed = T
)
## ANOVA Table (type II tests)
##
## Effect SSn SSd DFn DFd F p
## 1 Gender 1083.194 96626.99 1 118 1.323 0.252000
## 2 Intelligence 3465.586 96626.99 3 118 1.411 0.243000
## 3 BMI_status 16524.941 96626.99 3 118 6.727 0.000314
## 4 Gender:Intelligence 2703.193 96626.99 2 118 1.651 0.196000
## 5 Gender:BMI_status 8295.111 96626.99 3 118 3.377 0.021000
## 6 Intelligence:BMI_status 7715.048 96626.99 7 118 1.346 0.235000
## 7 Gender:Intelligence:BMI_status NA 96626.99 0 118 NA NA
## p<.05 ges
## 1 0.011
## 2 0.035
## 3 * 0.146
## 4 0.027
## 5 * 0.079
## 6 0.074
## 7 <NA> NA
Note that if the independent variable is within subject (if this is
an mixed analysis), then we put the independent variables that is
within subject in the within = section.
This is often used when analyze experiment results and want to analyze
for example the test score before learning and after learning. See below
to know more about the mixed ANOVA.
As mentioned, mixed ANOVA is more often used in analyze experiment results as it would be difficult to obtain individual changes over time with other methods.
Let us assume a high school is trying to know if different teaching styles affect students’ exam scores. They performed experiments on students, separating them into two groups, one group received education in a fun way, the other in a rigid way. The school also performed exams before they learn the subject and after learning.
Independent variable: teaching style (fun/rigid), exam time (before/after)
Dependent variable: exam score
teaching_style <- data.frame(
student = c("1", "2", "3", "4", "5", "6",
"1", "2", "3", "4", "5", "6"),
style = as.factor(c("fun", "rigid", "fun", "rigid", "fun", "rigid",
"fun", "rigid", "fun", "rigid", "fun", "rigid")),
timing = as.factor(c("before", "before", "before", "before", "before", "before",
"after", "after", "after", "after", "after", "after")),
score = c(50, 13, 10, 70, 23, 70, 70, 90, 66, 83, 80, 80)
)
The data frame is created. There are 6 students in total. 3 attended
the fun style teaching, 3 experienced the
rigid style. All of their before and
after exam scores are also presented.
Here, we want to see if the teaching style made differences on their score.
teaching_style |>
anova_test(
dv = score,
wid = student,
between = style,
within = timing,
detailed = T)
## ANOVA Table (type II tests)
##
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 (Intercept) 1 4 41418.750 1278.667 129.569 0.00034 * 0.929
## 2 style 1 4 954.083 1278.667 2.985 0.15900 0.232
## 3 timing 1 4 4524.083 1876.667 9.643 0.03600 * 0.589
## 4 style:timing 1 4 90.750 1876.667 0.193 0.68300 0.028
By the analysis, we can see that the style does not have a
significant impact on their scores. However, we can see that there is a
main effect on timing variable.
We can through looking into the means of the score of different timing to know that the students perform better after learning, no matter what kinds of learning style was implemented on them.
teaching_style |>
group_by(timing) |>
summarise(mean_score = mean(score))
## # A tibble: 2 × 2
## timing mean_score
## <fct> <dbl>
## 1 after 78.2
## 2 before 39.3
Additional notes
filter() to choose specific groups you wanted in
your analysis. For instance,
filter(Intelligence %in% c("high", "good") or
filter(Hair_color != "No Hair")group_by() could still come in handy when you want to
see both conditions’ ANOVA result.After ANOVA, if we want to dig deeper into the data (and if the ANOVA has significant result), we can use post hoc analysis.
I used to get confused regarding post hoc analysis and pairwise t test. Note that post hoc analysis comes after we retrieved a significant result from ANOVA, while pairwise t-test can be implemented on whatever two groups of data.
teaching_style |>
group_by(style) |>
pairwise_t_test(
score ~ timing,
paired = TRUE,
p.adjust.method = "bonferroni"
)
## # A tibble: 2 × 11
## style .y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
## * <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 fun score after before 3 3 3.64 2 0.068 0.068 ns
## 2 rigid score after before 3 3 1.53 2 0.267 0.267 ns
Here, we use post hoc analysis
score ~ timing: we put the dependent
variable on the left, the independent variable
on the right. Here, we compare the scores under different
exam timing.
paired = if it was repeated measures, then
TRUE, if not (implemented on different participants, aka
between-subject), then FALSE.
p.adjust.method: allows us to choose the method for
adjusting p values.
We took the heroes dataset as an example again.
heroes2 |>
ggplot(aes(x = Gender, y = Strength)) +
geom_bar(stat = "summary", position = position_dodge(.8), width = .7, fill = "#BDD5EA") +
geom_errorbar(stat = "summary", position = position_dodge(0.5), width = .12) +
facet_wrap(Intelligence ~ ., scales = "free")
Looks like a lot had happened during in the code chunk, also a lot of information from the graphs.
Let us start with the code.
in ggplot(), we specify the x axis and
y axis.
geom_bar() creates bar chart.
stat = "summary" helps you calculate the mean value
of the y-axis, Strength in this case.
position = position_dodge()
width = is the width of the bar. When the
width = 1, the bars would stick together.
fill = can either be a single color (like in this
example), or could be other variables.
geom_errorbar() helps calculate the standard error
of the value, then presented it on the graph.
width, fill, position
remain the same functions as in geom_bar()facet_wrap() helps you to separate graphs according
to the variable you put in.
You can either do <your variable> ~. or
~ <your variable>.
The scales = "free" helps change the y-axis of each
graph according to their maximum values. It is however not recommended
in some situations, we will explain below.
Then, let’s take a look at the graph. We can see a few things:
good intelligence group and the high
intelligence group, we will explain how to adjust the situation in the
scales section.If we do not want to use facet_wrap() on the graph, we
can also use different ways to distinct different categories. For
exapmle, if we put fill = with a variable, then we can ask
RStudio to help us fill different colors for each category.
heroes2 |>
ggplot(aes(x = Gender, y = Strength, fill = Intelligence)) +
geom_bar(stat = "summary", position = position_dodge(.8), width = .7) +
geom_errorbar(stat = "summary", position = position_dodge(.8), width = .12)
heroes2 |>
ggplot(aes(x = Height, y = Strength, color = Gender, shape = Gender)) +
geom_point() +
geom_smooth(method = "lm")
We can see from the plot that this is probably not the best data to
demonstrate scatter plot when the outliers are obviously affecting the
aesthetics. Let us try to move out that male hero on the very right side
of the graph. Since we are removing the hero (row), therefore we need to
use filter() to help us out.
heroes2 |>
filter(Height < 250) |>
ggplot(aes(x = Height, y = Strength, color = Gender, shape = Gender)) +
geom_point() +
geom_smooth(method = "lm")
Now that we have a clearer scatter plot, let us take a look inside the code.
In the aes() section, we can see that I identify the
color AND the shape with Gender. Usually one kinds of
specification would work, however, sometimes the color might not be
easily identifiable. Therefore, it is useful to add another features in
this category to make your graph more readable.
geom_point() is what we use to identify each data
point as a point on the graph. Thus, creating the scatter plot.
geom_smooth() is adding trend lines on the data.
Here by using method = "lm" meaning that we use linear
model, which keeps the line straight.
I am a bit tired of the data we have been using so far, so let us find a new fun dataset.
This dataset is inside an R package called openintro. It
contains lots of datasets. We are using one of them called
movies.
install.packages("openintro")
library("openintro")
movies
## # A tibble: 140 × 5
## movie genre score rating box_office
## <chr> <chr> <dbl> <chr> <dbl>
## 1 2Fast2Furious action 48.9 PG-13 127.
## 2 28DaysLater horror 78.2 R 45.1
## 3 AGuyThing rom-comedy 39.5 PG-13 15.5
## 4 AManApart action 42.9 R 26.2
## 5 AMightyWind comedy 79.9 PG-13 17.8
## 6 AgentCodyBanks action 57.9 PG 47.8
## 7 Alex&Emma rom-comedy 35.1 PG-13 14.2
## 8 AmericanWedding comedy 50.7 R 104.
## 9 AngerManagement comedy 62.6 PG-13 134.
## 10 AnythingElse rom-comedy 63.3 R 3.21
## # ℹ 130 more rows
In this dataset, there are 5 columns and 140 rows. They are movies
released in 2003. The data contains the movie title, genre, score (by
critics on a scale 0-100), MPAA rating, and box_office
(millions of dollars earned at the box office in the US and Canada.)
You can type movies on the help section on
the bottom-right panel to see a more detailed description of the
data.
movies |>
ggplot(aes(x = box_office, y = score)) + geom_point() +
geom_smooth(method = "lm")
We can see from the graph, that there seems to be a positive
relationship between the box_office and
score.
A boxplot helps us see the interquartile range in our data. This is useful when you are comparing categorical variable and numerical values.
movies |>
ggplot(aes(x = score, y = rating)) +
geom_boxplot()
movies |>
ggplot(aes(x = score)) +
geom_histogram()
movies |>
ggplot(aes(x = score)) +
geom_histogram(binwidth = 1)
movies |>
ggplot(aes(x = score)) +
geom_histogram(binwidth = 10)
binwidth controls the thickness of columns, if it is
very thin (binwidth = 1), then you would see each data
clearly, however, this is not the best case when we want to see a
general trend of the data. Therefore, I would recommend try out
different binwidth numbers until you find the one that
suits your data the most.movies |>
ggplot(aes(x = score, color = rating)) +
geom_histogram(binwidth = 5)
We can see that the color controls the
outline color of the bar. While fill
controls the bar itself.
movies |>
ggplot(aes(x = score, fill = rating)) +
geom_histogram(binwidth = 5)
Density plots look like when we connect each high points of the histogram, and create a smoother line of the data.
movies |>
ggplot(aes(x = score)) +
geom_density()
We can also separate data into sub categories.
movies |>
ggplot(aes(x = score, fill = rating, color = rating)) +
geom_density()
However, this graph is a bit difficult to read. One reason is that
there are data overlap. We are unable to see data underneath rating
R and PG-13 clearly. Therefore, we are going
to change the transparancy of our graph.
movies |>
ggplot(aes(x = score, fill = rating, color = rating)) +
geom_density(alpha = 0.5)
We can also separate them more distantly by using
geom_density_ridges().
install.packages("ggridges")
library(ggridges)
movies |>
ggplot(aes(x = score,y = rating, fill = rating, color = rating)) +
geom_density_ridges(alpha = 0.6)
We are creating
We will be using another package: GGally.
install.packages("GGally")
library(GGally)
Let us say we want to see if there is any relationship between the
lego price and its size.
lego_sample |>
ggpairs(columns = c("price", "size"))
Let us take a deeper look at the code and the results.
The price in the dataset is a continuous numeric
variable, while the size is an categorical
variable.
In the result, we can see the top-left panel is the distribution of
the price. It is a univariate plot as it is
price vs price observed from the columns and
rows. Same situation applies to the bottom-right panel. We know that the
size is a categorical variable, therefore RStudio helps us
count the amount of lego in both sizes.
The top-right panel is price vs size. When
comes to numeric variable vs categorical
variable, a boxplot can be informative. Thus, RStudio drew the
plot for us. The bottom-left panel conveys the same idea as the
top-right, as they are both price vs size. It
separated different sizes (large and small)
and plotted the amount of different prices. Sadly, the graph did not
tell us which one belongs to the size large and which one
is the size small.
Let us take a look at other variables.
In the dataset, there is a column called price, noting
the recommended retail price of lego. Also, there is a special column
called amazon_price. You can guess from the name that this
is the lego price on amazon.
lego_sample |>
ggpairs(columns = c("price", "amazon_price"))
This graph is different from our examples above. The reason behind is
that the two variables here (price and
amazon_price) are both numerical.
Therefore, on the top-left panel contains the density plot of the
recommended retail price, while on the bottom-right panel
there is the amazon_price. We can see that the distribution
is roughly the same, however, not identical.
On the bottom-right panel, there is a scatter plot of the data. We can see a relatively strong positive correlation between the two variables, as we can see from the top-right panel, in which the correlation coefficient is shown.
Let us see what would happen if we put only numeric variables inside.
Before, we made the variable minifigures a
factor instead of its original
numeric variable. However, we need numeric variables right
now. So we do not need to put as.factor() function
here.
lego_sample |>
ggpairs(columns = c("price", "amazon_price", "minifigures"))
We can see this is a good way to grasp a general idea of the relationships of our data.
If we want some extra information on the graph, for example, we want a trend line on the scatter plot, with the error area.
lego_sample |>
ggpairs(columns = c("price", "amazon_price", "minifigures"),
lower = list(continuous = wrap("smooth")))
Here, we simply add lower = and specify that we
want.
lower = means the lower part from the diagonal
line.
continuous =: we are telling RStudio our data is
continuous numerical variable, we put wrap() to specify how
we want to do with the data.
smooth: means that we put geom_smooth
on the lower panel of diagonal line.
Note that the correlation method used here is the default
"pearson". If you would like to change the method, you
could put some extra codes here.
lego_sample |>
ggpairs(columns = c("price", "amazon_price", "minifigures"),
upper = list(continuous = wrap("cor", method = "spearman")),
lower = list(continuous = wrap("smooth")))
We now see an upper = code here. It is a bit different
from the lower = part.
cor: means that we want our upper panels to have
correlation coefficient. Here, if we do not want pearson
method as default, we can use method = to specify other
correlation coefficient methods.Now you have learned a few ways to deal with plots inside
ggpairs(). Let us add some colors on it (literally).
If we want to put a categorical variable here, a good way is to
separate different groups by colors. We can specify that on our
aes() layer.
lego_sample |>
ggpairs(columns = c("price", "amazon_price", "minifigures", "size"),
mapping = aes(color = size, alpha = .07),
upper = list(continuous = wrap("cor", method = "spearman")),
lower = list(continuous = wrap("smooth")))
Here, as you can see, I also added transparency alpha =
as I don’t want the plot be too difficult to read by the overlapping
graphs.
Below is another coding style to achieve the same result. I personally prefer this one because it separates the steps of selecting columns I want before moving on to the plotting section.
lego_sample |>
select(price, amazon_price, minifigures, size) |>
ggpairs(mapping = aes(color = size, alpha = .07),
upper = list(continuous = wrap("cor", method = "spearman")),
lower = list(continuous = wrap("smooth")))
Although ggpairs() comes in handy when you want to look
at relationships between different variables (i.e., you want to see
whether scores from different tests are related), we can also get dizzy
from all the graphs. If you are unsure whether this is the most helpful
way to present your data, I would recommend do plots separately first,
do a density plot or a box plot first, before you mix everything
together. It would help you clear out what you want to do with your
data, and how to do it in the best way possible.
A good graph contains informative labels. Of course you could add title, subtitle, and other labels after you have downloaded the graph on other software. However, it would be more efficient and cause fewer mistakes when we can finish labelling all at once.
movies |>
ggplot(aes(x = score,y = rating, fill = rating, color = rating)) +
geom_density_ridges(alpha = 0.6) +
labs(
title = "Movie Scores of Different MPAA Ratings.",
subtitle = "Put your subtitle here because I don't know what to say.",
x = "Movie Score",
y = "MPAA Rating",
fill = "Rating",
color = "Rating",
caption = "sources: movies from openintro")
If we do not want to present any label, we can also specify it and let the program to delete them for us.
movies |>
ggplot(aes(x = score,y = rating, fill = rating, color = rating)) +
geom_density_ridges(alpha = 0.6) +
labs(
x = NULL,
fill = NULL,
color = NULL,
fill = NULL)
We can see from the code chunk above, when we put NULL
for the x-axis, the label will not appear in the graph.
However, when we did not put y-axis on the
labs(), the y-axis would still present the label from your
original data.
Also, if you would like to remove the legend as well, you could do
that in theme().
movies |>
ggplot(aes(x = score,y = rating, fill = rating, color = rating)) +
geom_density_ridges(alpha = 0.6) +
labs(x = NULL, y = NULL) +
theme(legend.position = "hide")
Let us demonstrate how to change colors in graph with another dataset!
In this data lego_sample, there are different lego sets,
pieces, recommended retail price, prices on Amazon, sizes and so on.
lego_sample
## # A tibble: 75 × 14
## item_number set_name theme pieces price amazon_price year ages pages
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 10859 My First Lady… DUPL… 6 4.99 16 2018 Ages… 9
## 2 10860 My First Race… DUPL… 6 4.99 9.45 2018 Ages… 9
## 3 10862 My First Cele… DUPL… 41 15.0 39.9 2018 Ages… 9
## 4 10864 Large Playgro… DUPL… 71 50.0 56.7 2018 Ages… 32
## 5 10867 Farmers' Mark… DUPL… 26 20.0 37.0 2018 Ages… 9
## 6 10870 Farm Animals DUPL… 16 9.99 9.99 2018 Ages… 8
## 7 10872 Train Bridge … DUPL… 26 25.0 22.0 2018 Ages… 16
## 8 10875 Cargo Train DUPL… 105 120. 129. 2018 Ages… 64
## 9 10876 Spider-Man & … DUPL… 38 30.0 74.5 2018 Ages… 20
## 10 10878 Rapunzel's To… DUPL… 37 30.0 99.0 2018 Ages… 24
## # ℹ 65 more rows
## # ℹ 5 more variables: minifigures <dbl>, packaging <chr>, weight <chr>,
## # unique_pieces <dbl>, size <chr>
lego_sample |>
na.omit() |>
ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
geom_bar(stat = "summary", position = "dodge")
For easier demonstration, we are going to look into legos only with 1 to 3 minifigures.
lego_sample |>
na.omit() |>
filter(minifigures <= 3) |>
ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
geom_bar(stat = "summary", position = "dodge")
Now we are all set!
lego_sample |>
na.omit() |>
filter(minifigures <= 3) |>
ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
geom_bar(stat = "summary", position = "dodge") +
scale_fill_manual(values = c("darkblue","yellow","pink"))
As simple as that! Also note that you only need to type each color
once. For example, in our graph, although color "darkblue",
"yellow", "pink" all appeared twice, we only
need to type them once in stead of
values = c("darkblue", "yellow", "pink", "darkblue", "yellow", "pink").
If you do not want the default colors, but are too lazy to pick distinct color each time you draw, we can use some color palettes inside RStudio.
install.packages("viridis")
library(viridis)
You can also add lines around the chart to make them pop up more.
lego_sample |>
na.omit() |>
filter(minifigures <= 3) |>
ggplot(aes(x = size, y = price, fill = as.factor(minifigures))) +
geom_bar(stat = "summary", position = "dodge", color = "black") +
scale_fill_viridis_d()
Note that we use scale_fill_viridis_d() here. When you
are using color = instead of fill =, remember
to switch to scale_color_viridis_d().
lego_sample |>
na.omit() |>
filter(minifigures <= 3) |>
ggplot(aes(x = size, y = price, color = as.factor(minifigures))) +
geom_bar(stat = "summary", position = "dodge", fill = "white") +
scale_color_viridis_d()
Let us look at some data from heroes2 dataset.
heroes2 |>
filter(Intelligence %in% c("good", "high")) |>
ggplot(aes(x = Gender, y = Strength)) +
geom_bar(stat = "summary", position = position_dodge(.8), width = .7) +
geom_errorbar(stat = "summary", position = position_dodge(0.5), width = .12) +
facet_wrap(Intelligence ~ ., scales = "free")
Let us take a quick glanse at the chart. Looks like good
intelligence group has higher strength than high
intelligence group as the bar is higher.
However, is it really the case?
Take a deeper look at these two bar charts. The y-axis scale is different from one graph to another.
To be sure, let us calculate their average strength between two groups.
heroes2 |>
filter(Intelligence %in% c("good", "high")) |>
group_by(Intelligence, Gender) |>
summarise(mean_strength = mean(Strength))
## # A tibble: 4 × 3
## # Groups: Intelligence [2]
## Intelligence Gender mean_strength
## <chr> <chr> <dbl>
## 1 good F 34
## 2 good M 45.4
## 3 high F 55
## 4 high M 57.0
We can see from the result that in the good intelligence
group, the mean strength of both genders are lower than the
high intelligence group. However, in the graph, if we did
not pay attention to the y-axis at first, we might get an wrong
impression of the good intelligence group has higher
strength than the high intelligence group.
If we want to resolve this issue, we can manually change the scale by
using coord_cartesian(). We can either remove the
scales = "free" or leave it there. Since the code was
implemented layer by layer, when we added the manual scale after
scales = "free", then the graph would not be affected.
heroes2 |>
filter(Intelligence %in% c("good", "high")) |>
ggplot(aes(x = Gender, y = Strength)) +
geom_bar(stat = "summary", position = position_dodge(.8), width = .7, fill = "#BDD5EA") +
geom_errorbar(stat = "summary", position = position_dodge(0.5), width = .12) +
facet_wrap(Intelligence ~ ., scales = "free") +
coord_cartesian(ylim = c(0,100))
After putting each graph on the same scale, it would be a lot easier to look into the relationships between graphs.
You can modify the ylim() number according to your
data.
theme() is the function to change the theme of the
graph. I commonly use theme_classic(),
theme_light(), or the default
theme_gray().
library(ggplot2)
library(dplyr)
lith <- data.frame(name=c("A","B","C") ,
value=c(4,4,4))
lith %>%
ggplot(aes(x = name, y = value)) +
geom_bar(stat = "summary", width = 1, fill = c("#BE3A34","#046A38","#FFB81C")) +
coord_flip()+
theme_void()
latv <- data.frame(name=c("A","B","C") ,
value=c(4,4,4))
latv %>%
ggplot(aes(x = name, y = value)) +
geom_bar(stat = "identity", width = c(1.2, 0.9, 1.2), fill = c("#A4343A","#FFFFFF","#A4343A")) +
coord_flip()+
theme_void()
I am still a bit struggling with the width part. The flag looks okay but I don’t think this is the correct way to do it.
Now let’s try some with vertical lines.
ital <- data.frame(name=c("A","B","C") ,
value=4)
ital %>%
ggplot(aes(x = name, y = value)) +
geom_bar(stat = "identity", width = 1, fill = c("#008C45","#F4F9FF","#CD212A")) +
theme_void()
Still struggling with the Icelandic flag. Stay tuned.
specie <- c("A","B" ,"C","D","E","F","G","H","I")
condition <- c("a","b","c","d","d","c","d","c","d",
"e","a","b","c","d","c","d" ,"b","c",
"d","e","a","b","d","c","d","a" ,"b",
"c","d","e","a","d","c","d","e", "a" ,
"b","c","d","e","d" ,"c","d","d","e")
value <- c(6, 0.75, 1, 0.75, 3)
data <- data.frame(specie,condition,value)
ggplot(data, aes(fill=condition, y=value, x=specie)) +
geom_bar(position="fill", stat="identity", width = 1)+
scale_fill_manual(values=c("blue", "white", "red","white","blue")) +
coord_flip() +
theme_void()